Sains Malaysiana
52(9)(2023):
2725-2732
http://doi.org/10.17576/jsm-2023-5209-20
Statistical
Methods for Finding Outliers in Multivariate Data using a Boxplot and Multiple
Linear Regression
(Kaedah
Statistik untuk Mencari Data Terpencil dalam Data Multivariat menggunakan Plot
Kotak dan Regresi Linear Berganda)
THEERAPHAT THANWISET & WUTTICHAI
SRISODAPHOL*
Department
of Statistics, Khon Kaen University, 40002 Khon Kaen, Thailand
Received: 1 December
2022/Accepted: 15 August 2023
Abstract
The objective
of this study was to propose a method for detecting outliers in multivariate
data. It is based on a boxplot and multiple linear regression. In our proposed
method, the box plot was initially applied to filter the data across all
variables to split the data set into two sets: normal data (belonging to the
upper and lower fences of the boxplot) and data that could be outliers. The
normal data was then used to construct a multiple linear regression model and
find the maximum error of the residual to denote the cut-off point. For the
performance evaluation of the proposed method, a simulation study for
multivariate normal data with and without contaminated data was conducted at
various levels. The previous methods were compared with the performance of the
proposed methods, namely, the Mahalanobis distance and Mahalanobis distance
with the robust estimators using the minimum volume ellipsoid method, the
minimum covariance determinant method, and the minimum vector variance method.
The results showed that the proposed method had the best performance over other
methods that were compared for all the contaminated levels. It was also found
that when the proposed method was used with real data, it was able to find
outlier values that were in line with the real data.
Keywords:
Boxplot; multivariate data; multiple linear regression; outlier
Abstrak
Objektif kajian ini adalah untuk mencadangkan kaedah untuk mengesan
data terpencil dalam data multivariat. Ia berdasarkan plot kotak dan regresi
linear berganda. Dalam kaedah yang kami cadangkan, plot kotak pada mulanya
digunakan untuk menapis data merentas semua pemboleh ubah untuk membahagikan
set data kepada dua set: data biasa (kepunyaan pagar atas dan bawah plot kotak)
dan data yang boleh menjadi data terpencil. Data biasa kemudiannya digunakan
untuk membina model regresi linear berganda dan mencari ralat maksimum baki
untuk menandakan titik potong. Untuk penilaian prestasi kaedah yang
dicadangkan, kajian simulasi untuk data normal multivariat dengan dan tanpa
data tercemar telah dijalankan pada pelbagai peringkat. Kaedah sebelumnya
dibandingkan dengan prestasi kaedah yang dicadangkan, iaitu, jarak Mahalanobis
dan jarak Mahalanobis dengan penganggar teguh menggunakan kaedah ellipsoid isi
padu minimum, kaedah penentu kovarian minimum dan kaedah varians vektor
minimum. Keputusan menunjukkan bahawa kaedah yang dicadangkan mempunyai
prestasi terbaik berbanding kaedah lain yang dibandingkan untuk semua tahap
yang tercemar. Didapati juga apabila kaedah yang dicadangkan digunakan dengan
data sebenar, ia dapat mencari nilai data terpencil yang selari dengan data
sebenar.
Kata kunci: Data berbilang variasi; data terpencil; plot kotak;
regresi linear berganda
REFERENCES
Aelst, S.V. & Rousseeuw, P. 2009. Minimum volume ellipsoid. WIREs
Computational Statistics 1: 71-82.
Anscombe, F.J. & Guttman, I. 1960. Rejection of outliers. Technometrics 2(2): 123-147.
Belsley, D.A., Kuh, E. & Welsch, R.E. 1980. Regression
Diagnostics: Identifying Influential Data and Sources of Collinearity. New
York: John Wiley & Sons.
Cabana, E., Lillo, R.E. & Laniado, H. 2021. Multivariate
outlier detection based on a robust Mahalanobis distance with shrinkage
estimators. Stat Papers 62: 1583-1609.
Cook, R.D. 1977. Detection of influential observations in
regression. Technometrics 19: 15-18.
Herdiani, E.T., Sari, P.P. & Sunusi, N. 2019. Detection of
outliers in multivariate data using minimum vector variance method. Journal
of Physics: Conference Series 1341(9): 092004.
Hoaglin, D.C. & Welsch, R.E. 1978. The hat matrix in regression
and ANOVA. The American Statistician 32: 17-22.
Hubert, M. & Debruyne, M. 2010. Minimum covariance determinant. WIREs Computational Statistics 2: 36-43.
Lichtinghagen, R., Klawonn, F. & Hoffmann, G. 2020. UCI
Machine Learning Repository. Irvine: University of California, School of
Information and Computer Science.
https://archive.ics.uci.edu/ml/datasets/HCV+data
Mahalanobis, P.C. 1936. On the generalized distance in statistics. Proceedings
of the National Institute of Sciences of India 2(1): 49-55.
Montgomery, D.C., Peck, E.A. & Vining, G.G. 2012. Introduction
to Linear Regression Analysis. 3rd ed. New York: John Wiley & Sons.
Tukey, J.W. 1977. Exploratory Data Analysis. Massachusetts:
Addison Wesley.
*Corresponding author;
email: wuttsr@kku.ac.th
|